Project 2: Pattern Matching in Compressed DNA Sequence

نویسنده

Barna Saha

چکیده

Space efficient storage of large genome sequences requires good compression techniques. However, if these sequences need to be decompressed, before any processing can be done over them, the advantage of compression is lost. New techniques are required to extend the traditional pattern matching algorithms to work directly on the compressed sequence. This saves space in memory, requires less disk access and results in high speed up. In this project we will explore one such pattern matching algorithm on compressed DNA sequence, known as Derivative Boyer-Moore algorithm [2]. We will compare its running time with the traditional exact string matching algorithm, the Boyer-Moore algorithm[6], the fastest known exact string matching algorithm AGREP [3] and LZgrep, which is another algorithm that searches directly on the compressed sequence. 1 Project Description Perhaps one of the most recurrent subproblems appearing in almost every applications of computer science is the need to find the occurrences of a pattern string inside a large text. The problem is especially important in computational biology, where large sized DNA sequences are searched for finding matching patterns. Each DNA sequence contains only four alphabets A,C,T,G, but they are generally very large in size and contains vast amount of information. The human genome for instance contains three billions characters over twenty-three pairs of chromosomes. Pattern matching over such long DNA sequences requires algorithms which can handle the sequences efficiently and have very fast time complexity for search operation. In order to save storage space, it is natural to store the DNA sequences in a compressed form in a secondary storage. However decompressing them in the main memory before searching, may result in memory overflow, multiple disk access and slower running time. To overcome these adverse effects, recently there is a surge of interest in designing pattern matching algorithms which look for exact occurrences of a pattern in a compressed DNA sequence without first decompressing it. The technique allows reduction

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Dna Compressed and Sequence Searching on Multicore

One of the used of string matching is to search DNA sequence in the DNA database. This simple operation can be done in hours or days, because the huge size of DNA sequence database. On the other hand, the potential of multicore for DNA sequence searching is not fully explored due to the difficulty of multicore programming. This paper evaluates several key string matching algorithms using a comp...

متن کامل

The Complexity of Two - DimensionalCompressed Pattern -

We consider the complexity of problems for highly compressed 2-dimensional texts: compressed pattern-matching (when the pattern is not compressed and the text is compressed) and fully compressed pattern-matching (when also the pattern is compressed). First we consider 2-dimensional compression in terms of straight-line programs, see 9]. It is a natural way for representing very highly compresse...

متن کامل

An efficient pattern matching scheme in LZW compressed sequences

Compressed pattern matching (CPM) is an emerging research field addressing the problem: given a compressed sequence and a pattern, process the sequence with minimal (or no) decompression to find the pattern occurrence(s) in the uncompressed sequence. It can be applied to detect malwares and confidential information leakage in compressed files directly. In this paper, we report our work of CPM i...

متن کامل

Fast search in DNA sequence databases using punctuation and indexing

Exact pattern searching in DNA sequence databases has applications in identification of highly conserved regulatory sequences, the design of hybridization probes, and improving performance of approximate homology searching tools such as BLAST and BLAT. We propose a new pattern searching algorithm, CompressedPunctuated-Boyer-Moore (cp-BM), to enhance exact pattern match searches of DNA sequences...

متن کامل

A Fully Compressed Pattern Matching Algorithm for Simple Collage Systems

We study the fully compressed pattern matching problem (FCPM problem): Given T and P which are descriptions of text T and pattern P respectively, find the occurrences of P in T without decompressing T or P. This problem is rather challenging since patterns are also given in a compressed form. In this paper we present an FCPM algorithm for simple collage systems. Collage systems are a general fr...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2008

Project 2: Pattern Matching in Compressed DNA Sequence

نویسنده

چکیده

منابع مشابه

Dna Compressed and Sequence Searching on Multicore

The Complexity of Two - DimensionalCompressed Pattern -

An efficient pattern matching scheme in LZW compressed sequences

Fast search in DNA sequence databases using punctuation and indexing

A Fully Compressed Pattern Matching Algorithm for Simple Collage Systems

عنوان ژورنال:

اشتراک گذاری